Pruning Decision Trees with Misclassification Costs
نویسندگان
چکیده
We describe an experimental study of pruning methods for decision tree classifiers in two learning situations: minimizing loss and probability estimation. In addition to the two most common methods for error minimization, CART'S cost-complexity pruning and C4.5'~ errorbased pruning, we study the extension of cost-complexity pruning to loss and two pruning variants based on Laplace corrections. We perform an empirical comparison of these methods and evaluate them with respect to the following three criteria: loss, mean-squared-error (MSE), and log-loss. We provide a bias-variance decomposition of the MSE to show how pruning affects the bias and variance. We found that applying the Laplace correction to estimate the probability distributions at the leaves was beneficial to all pruning methods, both for loss minimization and for estimating probabilities. Unlike in error minimizat,ion, and somewhat surprisingly, performing no pruning led to results that were on par with other methods in ternis of the evaluation criteria. The main advantage of pruning was in the reduction of the decision tree size, sometimes by a factor of 10. While no method dominated others on all datasets, even for the same domain different pruning mechanisms are better for different loss matrices. We show this last result using Receiver Operating Characteristics (ROC) curves. 1 Pruning Decision Trees Decision trees are a widely used symbolic modeling technique for classification tasks in machine learning. The most common approach to constructing decision tree classifiers is to grow a full tree and prune it back. Pruning is desirable because the tree that is grown may overfit the data by inferring more structure than is justified by the training set. Specifically, if there are no conflicting instances, the training set error of a fully built tree is zero, while the true error is likely to be larger. To combat this overfitting problem, the tree is pruned back with the goal of identifying the tree with the lowest error rate on previously unobserved instances, breaking ties in favor of smaller trees (Breiman, Friedman, 0l:;hen & Stone 1984, Quinlan 1993). Several pruning methods have been introduced in the literature, including cost-complexity pruning (Breiman et al. 1984), reduced error pruning artd pessimistic pruning (Quinlan 1987), error-based pruning (Quinlan 1993), penalty pruning (Mansour 1997), and MDL pruning (Quinlan & Rivest 1989, IMehta, Rissanen & Agrawal 1995, Wallace & Patrick 1993). Esposito, Malerba & Semeraro (1995a, 19956) have compared several of these pruning algorithms for error minimization. Oates & Jensen (1997) showed that most pruning algorithms create trees that are larger than necessary if error minimization is the evaluation criterion. Our objective in this paper is different than the above-mentioned studies. Instead of pruning to minimize error, we aim to study pruning algorithrr~s with two related goals: loss minimization and probability estimation. Historically, most pruning algorithms have been developed to minimize the expected error rate of the decision tree, assuming that classification errors have the same unit cost. However in many practical applications one has a loss matrix associated with classification errors (Turney 1997, Fawcett & Provost 1996, Kubat, Holte & Matwin 1997, Danyluk & Provost 1993). In such cases, it may be desirable to prune the tree with respect to the loss matrzx or to prune in order to optimize the accuracy of a probability distribution given for each instance. A probability distribution may be used to adjust the prediction to minimize the expected loss or to supply a confidence level associated with the prediction; in addition, a probability distribution may also be used to generate a lift curve (Bcsrry & Linoff 1997). Pruning for loss minimization or for probability estimation can lead to different pruning behavior than does pruning for error minimization. Figure 1 (left) shows an example where the subtree should be pruned by error-minimization algorithms because the number of errors stays the same (51100) if the subtree is pruned t o a leaf. If the problem has an associated loss matrix that specifies that the cost of misclassifying someone who is sick as healthy is ten times as costly as classifying someone who is healthy as sick, then we don't want the pruning algorithm to prune this subtree. For this loss matrix, pruning the tree leads to a loss of 50, whereas retaining the tree leads to a loss of 5 (the left hand leaf would classify instances as sick to minimize the expected loss). Figure 1 (right) illustrates the reverse situation: error-based pruning would retain the subtree, ' 3s mawhereas cost-based pruning would prune the subtree. Given the same lo., Fig. 1. The left figure shows a tree that should be pruned by error-minimization algorithms (pruning does not change the number of errors) but not by loss-minimization algorithms with a 10 to 1 loss for classifying sick as healthy against vice-versa. The right tree shows the opposite situation where error minimization algorithms should not prune, yet loss minimization with a 10 to 1 loss should prune since both leaves should be labeled "sick." trix as the first example, each of the leaf nodes would classify an example as sick and a pruning algorithm that minimizes loss should collapse them (if pruning attempts to minimize loss then if all children are labeled the same, they should be pruned.) These examples illustrate that it is of crit.ical importance that the pruning criterion be based on the overall learning task evaluation criterion. In this paper, we investigate the behavior of several pruning algorithms. In addiiiion to the two most common met.hods for error minimization, cost-complexity pruning (Breiman et al. 1984) and error-based pruning (Quinlan 1993), we study the extension of cost-complexity pruning to loss and two pruning variants based on Laplace corrections (Cestnik 1990, Good 1965). We perform an empirical comparison of these met.hods and evaluate them with respect to the following criteria: loss under two matrices, average mean-squared-error (MSE), and average log-loss. While it is expected that no method dominates anot.her on all problems, we found that adjusting the probability distribut.ions at the leaves using Laplace was beneficial to all methods. While no method dominated others on all datasets, even for the same domain different pruning mechanisms are better for different loss matrices. We show this last result using Receiver Operating Characteristics (ROC) curves (Provost & Fawcett 1997). 2 The Pruning Algorithms and Evaluation Criteria 2.1 Probability Estimation and Loss Minimization at the Leaves A decision tree can be used to estimate a probability distribution on the label values rather than to make a single prediction. Such trees are sometimes called class probability trees (Breiman et al. 1984). Several methods have been proposed to predict class distributions, including frequency count.^, Laplace correc:tions, and smoothing (Breiman et al. 1984, Buntine 1992, Oliver & Hand 1995). In our experiments, we use the former two methods. The frequency-counts met.hod simply predicts a distribution based on the counts at the leaf the test instance falls into. Frequency counts are sometimes unreliable because the tree was built to separate the classes and the prol~ability estimates tend to be extreme at the leaves (e.g., zero probabilities). The Laplace correction method biases the probability towards a uniform distribution. Specifically, if a node has m instances, c of which are from .s given class, in a k-class problem, the probability assigned to the class is (c+ l ) / ( m + k ) (Good 1965: Cestnik 1990). Given a probability distribution and a loss matrix, it is simple to compute the class with the expected minimal lass by multiplying the probability distribution vector by the loss matrix. When misclassification costs are equal, minimizing the expected loss is equivalent to choosing the majority class (ties can be broken arbitrarily). 2.2 Pruning for Error and Loss Minimization Most pruning algorithms perform a post-order traversal of the tree, replacing a subtree by a single leaf node when the estimated error of the leaf replacing the subtree is lower than that of the subtree. The crux of the problem is to find an honest estimate of error (Breiman et al. 1934), which is defined as one that is not overly optimistic for a tree that was built to minimize errors in the first place. The resubstitution error (error rate on the training set) does not provide a suitable estimate because a leaf-node replacing a subtree will never haw fewer errors on the training set than the subtree. The two most commonly used pruning algorithms for error minimization are error-based pruning (Quinlan 1993) and cost-complexity pruning (Breiman et al. 1984). The error-based pruning algorithm used in C4.5 estimates the error of a leaf by computing a statistical confidence interval of the resubstitution error (error on the training set for the leaf) assuming an independent binomial model and selecting the upper bound of the confidence interval. The width of the confidence interval is a tunable parameter of the algorithm. The estimated error for a subtree is the sum of the errors for the leaves underneath it . Because leaves haw fewer instances than their parents, their confidence interval is wider, possibly leading to larger estimated errors, hence they may be pruned. We were unable to generalize C 4 . 5 ' ~ error-based pruning based on conlidence intervals to take into account losses. The naive idea of computing a confidence interval for each probability and computing the losses based on the upper bound of the interval for each class yields a distribution that does not add to one. Experimental results we made on some variants (e.g., normalizing the probabilities) did not perform well. Instead, we decided to use a Laplace-based pruning method. The Laplace-based pruning method we introduce here has a similar niotivation to C 4 . 5 ' ~ error-based pruning. The leaf distributions based on the Laplace correction described above are computed. This correction makes the distribution at the leaves more uniform and less extreme. Given a node, we can compute the expected loss using the loss matrix. The expected loss of a subtree is the expected loss of the leaves. Figure 2 (left) shows an example of Laplace-based pruning with a 10 to 1 loss matrix. In this case each of the children predicts sick in order to minimize the expected loss. To see why, consider the right-hand child Fig. 2. Example of Laplace-based pruning. On the left is an example where the parent has a lower loss than its children so the subtree would be pruned to a leaf. On the right is an example of an unintuitive behavior of Laplace-based pruning. The parent would classify instances as healthy while both children will classify them as sick. For each node, the expected loss for each class is computed by multiplying the nu~nber of instances at the node by the estimated probability for the class times the loss given the leaf's prediction. for which we have 20 healthy and 10 sick instances. After the Laplace correction, we have the distribution 21/32 = 0.6562 for class healthy and 11/32 = ,3438 for class sick. If the loss from misclassifying a healthy case as sick is 1 and the cost of misclassifying sick as healthy is 10, then the expected loss for an instance of class healthy is 0.6562, whereas for class sick it is 3.438. Because the parent has a lower loss than the sum of losses of the children (20.0 versus 20.52), the subtree will be pruned. When coupled with loss matrices, the Laplace correction sometimes leads to unintuitive pruning behavior. Consider Figure 2 (right). Each of the leaves would predict sick given the 10 to 1 loss matrix described above. The expected loss of the children is 16.2 each when they predict sick, whereas the expected loss of the parent is 28.44 when it predicts healthy. Hence, unlike error-based pruning, if all children have the same label, the parent may predict a different labd. The cost-complexity-pruning (CCP) algorithm used in CART penalizes the estimated error based on the subtree size. Specifically, the error estimate assigned to a subtree is the resubstitution error plus a factor cr times the subtree size. An efficient search algorithm can be used to compute all the distinct cr values that change the tree size and the parameter is chosen to minimize the errclr on a holdout sample or using cross-validation. Once the optimal value of cr is found, the entire training set is used to grow the tree and it is pruned using cr prekiously found. In our experiments, we have used the holdout method, holding back 20% of the training set to estimate the best cr parameter. Cost complexity pruning extends naturally to loss matrices. Instead clf estimating the error of a subtree, we estimate its loss (or cost), using the resubstitution loss and penalizing by the size of the tree times the cr factor as in errorbased CCP. 2.3 Pruning with Respect to Probability Estimates One possible objective for inducing a decision tree is to use it as a probability tree, namely, to predict probability distributions. Such a tree has several advantages: it can give a confidence level for its predictions; it can be used with different loss matrices, computing the best label for each instance using the probability distribution and the loss matrix at hand; and it can be used to generate a lift curve (Berry & Linoff 1997). The KL-pruning that we introduce prunes only if the distribution of a node and its children are similar. Specifically, the method is based on the KullbackLeibler (KL) distance (Cover & Thomas 1991) between the parent distri~bution and its children. For each node, we estimate the class distribution using the Laplace correction detailed in Section 2.1. If q, is the parent's probability for class c and pic is the ith child's probability for class c, the KL distance for child i is calculated by distancei = Cc picEog(pic/qc). This gives us a distance value for each child node. We then compute a weighted average distance of the c clnildren as C distance = x distancei * mi/m where m is the number of instances observed at the parent node and m, is the number of instances observed at child node i. If the average distance is less than a given threshold factor (parameter of the algorithm), then the subtree is pruned. Because the Laplace correction is used, the probabilities are never zero (alt:,hough this method is still valid if frequency counts are used because a zero probability for a class in the parent forces a zero probability for that same class in the child). In these experiments, we set the threshold to 0.01 based on initial experiments. In other experiments, we have noted that pruning performance can be ra,dically improved when this parameter is customized to the particular dataset. However, we did not attempt to fine-tune this parameter for the specific datasets used in this paper. 2.4 Evaluation Criteria For any given learning task there is a domain-specified evaluation criterion. The majority of reported research in decision trees has assumed that the learning evaluation criterion is to minimize the expected error of the classifier. In cases where a loss matrix is specified, the average loss for a testm-set is the average of the losses over the instances in the test set as determined by the loss matrix. Algorithms that make probabilistic distributions can easily be generalized to take into account the loss matrix by multiplying the two and predicting the class with the smallest loss. In many practical applications, it is important not only to classify each instance correctly or to minimize loss, but to also give a probability distribution on the classes. To measure the error between the true probability distribution and the predicted distribution, the mean-squared error (MSE) can be used (Breiman et al. 1984, Definition 4.18). The MSE is computed as the sum of the scluared differences between the probability p assigned by the classifier to each class c and the true probability distribution f: MSE = x ( f ( c ) ~ ( c ) ) ' Because test-sets supplied in practice have a single label per instance, one class has probability 100% and the others have zero. The MSE is therefore bounded between zero and two (Kohavi & Wolpert 1996), so in this paper we use half the MSE as a "normalized MSE" in so that it is a number between 0% and 100%. A classifier that makes a single prediction is viewed as assigning a probability of one t o the predicted class and zero t o the other classes; under those conditions, the average normalized MSE is the same as the classification error. A different measure of probability estimates is log-loss, which is sorr~etimes claimed to be a natural measure of the goodness of probability estimates (Elernardo & Smith 1993, Mitchell 1997). The loss assigned to a probability distribution p for an instance, whose true probability distribution is f , is the weighted sum of minus the log of the probability p assigned by the classifier to class c, where the weighting is done by the probability of class c: log-loss = ) f (c) log, p(c)
منابع مشابه
Cost-sensitive Decision Trees with Post-pruning and Competition for Numeric Data
Decision tree is an effective classification approach in data mining and machine learning. In some applications, test costs and misclassification costs should be considered while inducing decision trees. Recently, some cost-sensitive learning algorithms based on ID3, such as CS-ID3, IDX, ICET and λ-ID3, have been proposed to deal with the issue. In this paper, we develop a decision tree algorit...
متن کاملCost-sensitive C4.5 with post-pruning and competition
Decision tree is an effective classification approach in data mining and machine learning. In applications, test costs and misclassification costs should be considered while inducing decision trees. Recently, some cost-sensitive learning algorithms based on ID3 such as CS-ID3, IDX, λ-ID3 have been proposed to deal with the issue. These algorithms deal with only symbolic data. In this paper, we ...
متن کاملUse of Expert Knowledge for Decision Tree Pruning
Decision tree technology has been proven to be a valuable way of capturing human decision making within a computer. One main problem for many traditional decision tree pruning methods is that it is always assumed that all misclassifications are equally probable and equally serious. However, in a real-world classification problem, there may be a cost associated with misclassifying examples from ...
متن کاملCC4.5: cost-sensitive decision tree pruning
There are many methods to prune decision trees, but the idea of cost-sensitive pruning has received much less investigation even though additional flexibility and increased performance can be obtained from this method. In this paper, we introduce a cost-sensitive decision tree pruning algorithm called CC4.5 based on the C4.5 algorithm. This algorithm uses the same method as C4.5 to construct th...
متن کاملExample-dependent cost-sensitive decision trees
Several real-world classification problems are example-dependent cost-sensitive in nature, where the costs due to misclassification vary between examples. However, standard classification methods do not take these costs into account, and assume a constant cost of misclassification errors. State-of-the-art example-dependent cost-sensitive techniques only introduce the cost to the algorithm, eith...
متن کاملK-norm Misclassification Rate Estimation for Decision Trees
The decision tree classifier is a well-known methodology for classification. It is widely accepted that a fully grown tree is usually over-fit to the training data and thus should be pruned back. In this paper, we analyze the overtraining issue theoretically using an the k-norm risk estimation approach with Lidstone’s Estimate. Our analysis allows the deeper understanding of decision tree class...
متن کامل